CS 7324 Lab 2

Francesco Trozzi (47779944) - George Sammit (04010135) - Megan Simons (46334773)

Business Understanding

Overview

The dataset (https://www.kaggle.com/iluvchicken/cheetah-jaguar-and-tige) presents 4,000 400x400 color images of cheetahs, jaguars, tigers, and hyenas. There are 1,000 images of each type of animal. These images are taken in various settings, at different angles, with various profiles, and do not necessarily expose the entire animal. They also include pups, males, and females. These aspects may be seen in the following examples from the data set. This data was published to Kaggle in June 2020 as part of Professor Hyoseok Hwang's Deep Learning course. In fact, it appears to be a part of a classification exercise for his students.

Cheetahs
Jaguar
Tiger
Hyena

Business Case

The ability to classify animals in the wild has several potential applications. One such application could be facilitated by the use of outdoor cameras. For example so-called “urban” coyotes pose real threats to pets in Texas (https://texashillcountry.com/coyotes-texas-dangerous/). Strategically mounting cameras that have the ability to differentiate a coyote from a dog and alert appropriately could be valuable to local animal control agencies. In this application, misclassifications would pose a mere nuisances to the local agencies, and therefore an accuracy rate of over 90% seems appropriate. This application represent a business-to-consumer model. Another application that might be useful to researchers could be tracking the movements of animals without having to capture and tag them, i.e. through recognition of specific animals. A first step to such a process would involve classification. However, this application would likely require a higher accuracy rate. A final business case would be to monitor preserves or reservations of endangered species with drones. If the drones are capable of identifying the desired species from an image that it takes, it could track the animal autonomously, and report back to scientists greatly reducing their data collection work. In this case, such information would be used to track the increase or decrease of a certain animal species, providing clues of the type of action that a wild-life preservation agency should take or updating the conservation status of the animal. The latter applications are business-to-business (or non-profit) models.

Cheetahs and jaguars and tigers and hyenas, oh my!

The chart below depicts the families of the animals in this cohort:
family_tree.png

  • Cheetahs: One of the most prominent characteristics of cheetahs are their solid black spots. Note that another interesting feature are the dark “tear marks” running down from its eyes. It may be differentiated among these cohorts by its small round head and long tail. Also note that the last third of its tail is striped, not spotted.
  • Jaguars: Jaguars also poses a distinctive pattern, one that is similar to lepoards and cheetahs. Jaguars may be thought of as a more muscular version of a cheetah. Note the more elongated face. Also note the jaggedness of it "spots." Jaguar spots are called "rosettes" becuase of the resemblance to the petals of a rose when viewed from above.
  • Tigers: Arguably, the tiger is the most recognizable of the large cats due to the distinctive striping of its fur. Note that the striping is not as prominent in all populations as illustrated below. The striping continues onto its large round head as well as its large tail. Tigers are the largest animals of this cohort.
  • Hyenas: Hyenas are more dogs than cats. Note that his dataset contains mostly spotted hyenas, but done presents some of the other species noted above. Hyenas are pack anamilas that prefer eating carrion rather than hunting, so their bodies have adapted as such. They resemble the cats in this cohort primarily in coat appearance.

A note on camouflage

We would be remiss if we didn’t pause to reflect on the fact that these animals have evolved camouflage in their appearance. In other words, they are supposed to be difficult to identify in their natural habitat. Many of the images in our dataset depict these animals in such settings. Interestingly many animals are colorblind, and in this lab, we deliberately use grayscale images.

References

Data Preparation

Imports and constants

In [2]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
from skimage import io
from sklearn import utils

CLASSIFICATIONS = { 'cheetah' : 0, 'jaguar' : 1, 'tiger' : 2, 'hyena' : 3}
IMAGE_HEIGHT = 400
IMAGE_WIDTH = 400
IMAGE_FLATTENED_LENGTH = IMAGE_HEIGHT * IMAGE_WIDTH
MAX_IMAGES_PER_CLASSIFICATION = 1000

Load the files

In [3]:
def get_classifier(string_name):
    assert string_name in CLASSIFICATIONS.keys()
    return CLASSIFICATIONS[string_name]

# Images are stored in the directory as follows:
# ./raw_image
#   -> <animal>_<tag>.jpg
base_dir = os.path.join('.', 'desktop/raw_images')
image_files = os.listdir(base_dir)
num_images_files = len(image_files)

images = np.empty(shape=(num_images_files,IMAGE_FLATTENED_LENGTH), dtype=float)
classifications = np.empty(num_images_files, dtype=int)

processed = {}
for classifier in CLASSIFICATIONS.values():
    processed[classifier] = 0

print("Loading images...")
image_index = 0
for image_file in image_files:
    
    if image_index % 300 == 0:
        print("  {} images procesed".format(image_index))
        
    classifier = get_classifier(image_file.split('_')[0]) # 0 - animal, 1 = unimportant parts
    if processed[classifier] < MAX_IMAGES_PER_CLASSIFICATION:
        try:
            image = io.imread(os.path.join(base_dir, image_file), as_gray=True)
            images[image_index] = image.flatten()
            classifications[image_index] = classifier
            image_index += 1
            processed[classifier] += 1
        except Exception as ex:
            print("Skipping " + file + " " + str(ex))

num_images_processed = sum(processed.values())
print("Done")
print("{} images available".format(num_images_processed))

# Remove unused elements
images = np.delete(images, slice(num_images_processed,num_images_files), 0)
classifications = np.delete(classifications, slice(num_images_processed,num_images_files), 0)
Loading images...
  0 images procesed
  300 images procesed
  600 images procesed
  900 images procesed
  1200 images procesed
  1500 images procesed
  1800 images procesed
  2100 images procesed
  2400 images procesed
  2700 images procesed
  3000 images procesed
  3300 images procesed
  3600 images procesed
  3900 images procesed
Done
4000 images available

Vizualize a sample of the data

In [4]:
from ipywidgets import widgets

def image_flatten(image_nd_array):
    return image_nd_array.flatten()

def image_explode(image_1d_array):
    return image_1d_array.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
    
def get_random_indexes(num_indexes, array):
    """Returns a set of 4 unique random indexes over the input array"""
    rands = set()
    while len(rands) < num_indexes:
        rands.add(random.randint(0, len(array)-1))
    return rands
        
def show_random_images(num_to_show, images, caption):
    """Displays four random images from the input list"""
    rands = get_random_indexes(num_to_show, images)
    show_images(rands, images, caption)
    
def show_images(image_indexes, images, caption, label_image=False):
    """Show the images that correspond to the input indexes"""
    fig = plt.figure(figsize=(15, 4.6))
    fig.suptitle(caption)
    num_images_to_show = len(image_indexes)
    for plot_index in range(num_images_to_show):
        pic_index = image_indexes.pop()
        plt.subplot(1, num_images_to_show, plot_index + 1)
        plt.imshow(image_explode(images[pic_index]), cmap=plt.cm.gray)
        if label_image:
            plt.title("Image " + str(plot_index))

show_random_images(5, images, "Sample of images loaded")

Data Reduction

PCA with 1300 components

In [5]:
X = images

n_samples, n_features = X.shape
h, w = X.shape
n_classes = len(images)

print(np.sum(~np.isfinite(X)))
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
print("Original Image Sizes {} by {}".format(h,w))
0
n_samples: 4000
n_features: 160000
n_classes: 4000
Original Image Sizes 4000 by 160000
In [6]:
from sklearn.decomposition import PCA

n_components = 1300
print ("Extracting the top %d eigencats from %d cats" % (
    n_components, X.shape[0]))

pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigencats_1300 = pca.components_.reshape((n_components, 400, 400))
Extracting the top 1300 eigencats from 4000 cats
CPU times: user 13min 48s, sys: 1min 19s, total: 15min 8s
Wall time: 4min 12s

We used 1300 components for our number of components because it was the number needed to reach the 0.9 accuracy benchmark we established in the business section. This benchmark is said to have been reached when the Explained Variance Ratio is above 0.9, which is shown in the section titled "Explained Variance Ratio for PCA at 1300 and 2000 components".

In [7]:
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        #plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

plot_gallery(X, "", 400, 400) 
In [8]:
eigencats_titles = ["eigencat %d" % i for i in range(eigencats_1300.shape[0])]
plot_gallery(eigencats_1300, eigencats_titles, 400, 400)

From the plot of the first 18 components we observe that the complexity of the components increases as we consider more of them. This straightforward visualization reveals a characteristic of PCA: to reproduce complex patterns and high variability in the data more (as it will be shown in the explained variance section) and complex components are needed.

In [9]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    
idx_to_reconstruct = 1    
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))
In [10]:
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA')
plt.grid(False)

We observe that the image reconstruction made using 1300 components reproduces the characteristic features of the original image well. In fact, all the distinctive fur features of cheetahs, the "teardrop" striped face marks, and dotted fur are well captured.

PCA with 2000 components

In [11]:
from sklearn.decomposition import PCA

n_components = 2000
print ("Extracting the top %d eigencats from %d cats" % (
    n_components, X.shape[0]))

pca = PCA(n_components=n_components)
%time pca.fit(X.copy())
eigencats_2000 = pca.components_.reshape((n_components, 400, 400))
Extracting the top 2000 eigencats from 4000 cats
CPU times: user 21min 45s, sys: 3min 8s, total: 24min 54s
Wall time: 6min 57s

We increased to 2000 components because the images vary greatly from one another. There are four different animals and there are images that solely focus on the animals' faces and other images that include the entire body. This increase in the number of components comes with very little cost. In fact, we still reduce 98.75% of original image dimension.

In [12]:
eigencats_titles = ["eigencat %d" % i for i in range(eigencats_2000.shape[0])]
plot_gallery(eigencats_2000, eigencats_titles, 400, 400)
In [13]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    
idx_to_reconstruct = 1    
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image_2000 = reconstruct_image(pca,X_idx.reshape(1, -1))
In [14]:
plt.figure(figsize=[18,6])
plt.subplot(1,3,1)
plt.imshow(X_idx.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,3,2)
plt.imshow(reconstructed_image.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA \n 1300 Components')
plt.grid(False)
plt.subplot(1,3,3)
plt.imshow(reconstructed_image_2000.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA \n 2000 Components')
plt.grid(False)

The overall reconstructed image using 2000 components does not drastically improve the image reconstruction as 1300 components already provides a faithful representation of the original image. However, we noticed a slight improvement in capturing the edge that separates the animal from the background and in capturing some of the features (the spots, the tear drops, etc).

Explained Variance Ratio for PCA at 1300 and 2000 components

In [15]:
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
        
pca = PCA(n_components=1300)
X_pca = pca.fit(X)
plot_explained_variance(pca)

We used 1300 components for our number of components to get a 0.9 explained variance ratio for PCA. At 1300 components the explained variance ratio is 0.9016735. This value shows that minimal information loss (<10%) is achieved for this number of components. Such high fidelity in the reconstruction can help a classification analysis in two ways:

- without the need of using the full dimension image, the computational cost and therefore run time will be greatly reduced.

- It will reduce the sparcity of the data, which represents an issue in working with machine learning classification algorithms, potentially increasing the accuracy of the classification model.

In [16]:
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
        
pca = PCA(n_components=2000)
X_pca = pca.fit(X)
plot_explained_variance(pca)

When 2000 components are chosen, the explained variance ratio increases to 0.9481316. This means even less information loss. Which is helpful for the reasons explained above. As mentioned, this increase comes with little costs as the dimensionality is still reduced to 1.25% of the original dimensionity size.

Randomized PCA with 1300 components

In [17]:
n_components = 1300
print ("Extracting the top %d eigencats from %d cats" % (
    n_components, X.shape[0]))

rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(X.copy())
eigencats_1300_rpca = rpca.components_.reshape((n_components, 400, 400))
Extracting the top 1300 eigencats from 4000 cats
CPU times: user 13min 6s, sys: 1min 27s, total: 14min 33s
Wall time: 4min 32s
In [18]:
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        #plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

plot_gallery(X, "", 400, 400)
In [19]:
eigencats_titles = ["eigencat %d" % i for i in range(eigencats_1300_rpca.shape[0])]
plot_gallery(eigencats_1300_rpca, eigencats_titles, 400, 400)
In [20]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    
idx_to_reconstruct = 1    
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image_rpca = reconstruct_image(rpca,X_idx.reshape(1, -1))
In [21]:
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,2,2)
plt.imshow(reconstructed_image_rpca.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Reconstructed from Randomized PCA')
plt.grid(False)

Randomized PCA with 2000 components

In [22]:
n_components = 2000
print ("Extracting the top %d eigencats from %d cats" % (
    n_components, X.shape[0]))

rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(X.copy())
eigencats_2000_rpca = rpca.components_.reshape((n_components, 400, 400))
Extracting the top 2000 eigencats from 4000 cats
CPU times: user 21min 53s, sys: 3min 2s, total: 24min 56s
Wall time: 7min 16s
In [23]:
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        #plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

plot_gallery(X, "", 400, 400)
In [24]:
eigencats_titles = ["eigencat %d" % i for i in range(eigencats_2000_rpca.shape[0])]
plot_gallery(eigencats_2000_rpca, eigencats_titles, 400, 400)
In [25]:
def reconstruct_image(trans_obj,org_features):
    low_rep = trans_obj.transform(org_features)
    rec_image = trans_obj.inverse_transform(low_rep)
    return low_rep, rec_image
    
idx_to_reconstruct = 1    
X_idx = X[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image_rpca_2000 = reconstruct_image(rpca,X_idx.reshape(1, -1))
In [26]:
plt.figure(figsize=[18,6])
plt.subplot(1,3,1)
plt.imshow(X_idx.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
plt.subplot(1,3,2)
plt.imshow(reconstructed_image_rpca.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA \n 1300 Components')
plt.grid(False)
plt.subplot(1,3,3)
plt.imshow(reconstructed_image_rpca_2000.reshape((400, 400)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA \n 2000 Components')
plt.grid(False)

The overall reconstructed image using 2000 components does not drastically improve the image reconstruction as 1300 components already provides a faithful representation of the original image. However, we noticed a slight improvement in capturing the edge that separates the animal from the background and in capturing some of the features (the spots, the tear drops, etc).

Explained Variance Ratio for Randomized PCA at 1300 and 2000 components

In [27]:
def plot_explained_variance(rpca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = rpca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
        
rpca = PCA(n_components=1300, svd_solver='randomized')
X_rpca = rpca.fit(X)
plot_explained_variance(rpca)

We used 1300 components for our number of components to get a 0.9 explained variance ratio for Randomized PCA. At 1300 components the explained variance ratio is 0.901674.

In [28]:
def plot_explained_variance(rpca):
    import plotly
    from plotly.graph_objs import Bar, Line
    from plotly.graph_objs import Scatter, Layout
    from plotly.graph_objs.scatter import Marker
    from plotly.graph_objs.layout import XAxis, YAxis
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = rpca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
        
rpca = PCA(n_components=2000, svd_solver='randomized')
X_rpca = rpca.fit(X)
plot_explained_variance(rpca)

When 2000 components are chosen, the same that was chosen when Randomized PCA was performed, the explained variance ratio increases to 0.9481402.

Comparing PCA and Randomized PCA

In [38]:
wall_time = [252, 272]
bars = ('PCA', 'RPCA')
y_pos = np.arange(len(bars))
plt.bar(y_pos, wall_time, color=['red','blue'])
plt.xticks(y_pos, bars)

plt.title('Wall time for PCA and Randomized PCA at 1300 components')
plt.xlabel('Analysis Method')
plt.ylabel('Wall time (seconds)')

plt.show()
In [39]:
wall_time = [417, 436]
bars = ('PCA', 'RPCA')
y_pos = np.arange(len(bars))
plt.bar(y_pos, wall_time, color=['red','blue'])
plt.xticks(y_pos, bars)

plt.title('Wall time for PCA and Randomized PCA at 2000 components')
plt.xlabel('Analysis Method')
plt.ylabel('Wall time (seconds)')

plt.show()
In [40]:
#code from "https://stackoverflow.com/questions/28931224/adding-value-labels-on-a-matplotlib-bar-chart"

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Bring some raw data.
evr = [0.9016735, 0.901674, 0.9481316, 0.9481402]
evr_series = pd.Series.from_array(evr)

x_labels = ['PCA, n=1300', 'Randomized PCA, n=1300', 'PCA, n=2000', 'Randomized PCA, n=2000']

# Plot the figure.
plt.figure(figsize=(12, 8))
ax = evr_series.plot(kind='bar')
ax.set_title('Explained Variance Ratio for PCA and Randomized PCA at 1300 and 2000 components')
ax.set_xlabel('Analysis Method with number of components')
ax.set_ylabel('Explained Variance Ratio')
ax.set_xticklabels(x_labels)


def add_value_labels(ax, spacing=5):
    """Add labels to the end of each bar in a bar chart.

    Arguments:
        ax (matplotlib.axes.Axes): The matplotlib object containing the axes
            of the plot to annotate.
        spacing (int): The distance between the labels and the bars.
    """

    # For each bar: Place a label
    for rect in ax.patches:
        # Get X and Y placement of label from rect.
        y_value = rect.get_height()
        x_value = rect.get_x() + rect.get_width() / 2

        # Number of points between bar and label. Change to your liking.
        space = spacing
        # Vertical alignment for positive values
        va = 'bottom'

        # If value of bar is negative: Place label below bar
        if y_value < 0:
            # Invert space to place label below
            space *= -1
            # Vertically align label at top
            va = 'top'

        # Use Y value as label and format number with seven decimal places
        label = "{:.7f}".format(y_value)

        # Create annotation
        ax.annotate(
            label,                      # Use `label` as label
            (x_value, y_value),         # Place label at end of the bar
            xytext=(0, space),          # Vertically shift label by `space`
            textcoords="offset points", # Interpret `xytext` as offset in points
            ha='center',                # Horizontally center label
            va=va)                      # Vertically align label differently for
                                        # positive and negative values.


# Call the function above. All the magic happens there.
add_value_labels(ax)

plt.show()
/Users/megansimons/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: FutureWarning:

'from_array' is deprecated and will be removed in a future version. Please use the pd.Series(..) constructor instead.

The first bar graph compares the amount of wall time needed to perform PCA and Randomized PCA at 1300 components. The bar graph shows that PCA needed less wall time to perform its analysis; PCA needed 252 seconds while randomized PCA needed 272 seconds.

The second bar graph compares the amount of wall time needed to perform PCA and Randomized PCA at 2000 components. The bar graph shows that PCA needed less wall time to perform its analysis; PCA needed 417 seconds while randomized PCA needed 436 seconds.

With such high dimensional data, it is our surprise that PCA is faster than Randomized PCA because Randomized PCA is supposed to be faster without lowering the accuracy.

The second bar graph compares the explained variance ratio for PCA and Randomized PCA at 1300 components and 2000 components. Using 1300 components allowed both analysis methods to reach the 0.9 accuracy benchmark stated in the business section. Due to the variety of images in the data set and wanting to perform the best analyses, the analyses and image visualizations were done with 2000 components. The explained variance ratio for PCA at 1300 components was 0.9016735 and at 2000 components was 0.9481316. The explained variance ratio for Randomized PCA at 1300 components was 0.901674 and at 2000 components was 0.9481402.

In all previous notebooks Randomized PCA had a slightly longer wall time when it came to 1300 components, but it had a shorter time than PCA for 2000 components. This showed that Randomized PCA was faster when the number of components increased, which is desirable because we want the analysis method that will have a positive correlation between wall time and number of components. That did not happen in this final run through of the notebook, for reasons unknown but still surprising nonetheless.

In terms of the explained variance ratio, Randomized PCA had a higher ratio than PCA for both 1300 components and 2000 components. The higher ratio tells us that Randomized PCA is better at distinguishing variance between images than PCA is.

The quantitative and qualitative analyses above have led us to the conclusion that Randomized PCA is marginally better for this data set. However, due to the comparable performances we do not suggest one of the two methods over the other.

Feature extraction: Gabor Kernel

Background

Gabor kernels are filters used for image processing in order to extract patterns, texture analysis, and edge detention, etc. They are composed of a 2D-Gaussian distribution and a periodic sinusodial function. Their product results in a function which has the direction and frequency of the sinusoid but is localized in space by the gaussian component. The corresponsing formula is :

$$ g(x,y,\lambda,\theta,\psi,\sigma,\gamma) = \exp\left ( -\frac{x^{'2}+\gamma^{2}y^{'2}}{2\sigma ^{2}} \right ) \exp\left ( i\left ( 2\pi \frac{x^{'}}{\lambda } + \psi \right ) \right ) $$

Reading material used for this section: https://medium.com/@anuj_shah/through-the-eyes-of-gabor-filter-17d1fdb3ac97

Parameters

In this project we are going to fine-tune the following parameters:

  • $\Theta$: to control the orientation of the gaussian-localized sinusoidal function.
  • $\sigma$: to control the spread of the gaussian function, hence controlling the magnitude of the localization.
  • Frequency: expressed as inverse of the wavelength, $\lambda$, to control the spacing between two waves.

Convolution

The images are processed by sliding the kernel through each pixel of the original image. The kernel is centered to the pixel to be processed. Each $i^{th}$ element of the kernel is multiplied by the corresponsing $m^{th}$ element of the image. The results are summed and the corresponding value represents the new pixel. This process is called convolution.

Reading material used for this section: https://towardsdatascience.com/types-of-convolution-kernels-simplified-f040cb307c37

In [29]:
from skimage.filters import gabor_kernel
from scipy import ndimage as ndi
from scipy import stats
            
# compute the filter bank and take statistics of image
def compute_gabor(row, kernels, shape):
    feats = np.zeros((len(kernels), 4), dtype=np.double)
    for k, kernel in enumerate(kernels):
        filtered = ndi.convolve(row.reshape(shape), kernel, mode='wrap')
        _,_,feats[k,0],feats[k,1],feats[k,2],feats[k,3] = stats.describe(filtered.reshape(-1))
        
    return feats

# Returns the filtered image
def gabor_effect(row, kernels, shape):
    effect = np.zeros((len(kernels), 400, 400), dtype=np.double)
    for k, kernel in enumerate(kernels):
        filtered = ndi.convolve(row.reshape(shape), kernel, mode='wrap')
        effect[k] = filtered
        
    return effect

Parameters selection

In order to reduce the space search for the filter values, which bring the best performance for our "big cats" dataset, we performed a manual optimization guided by the visual inspection of the effect of the various Gabor filters parameters. The minimum amount of filters which should be considered is four, corresponding to four $\Theta$ values. Reducing further would cause losing directional information which is critical to correctly identify patterns.

In [30]:
# prepare filter bank kernels
kernels = []
kernel_label = []
sigma = [1, 3, 6]
frequency = [0.1, 0.4]
for s in sigma:
    for f in frequency:

        for theta in range(4):
            theta = theta / 4. * np.pi
    
            kernel = np.real(gabor_kernel(f, theta=theta, sigma_x=s, sigma_y=s))
            kernel_label.append([f,s,theta])
            kernels.append(kernel)

gabr_feature = gabor_effect(images[3200], kernels, (IMAGE_HEIGHT,IMAGE_WIDTH))
plt.figure(figsize=[16,16])
for i in range(1, 25):
    plt.subplot(6, 4,i)
    plt.imshow(gabr_feature[i-1], cmap="gray")
    plt.ylabel('Frequency: '+str(kernel_label[i-1][0])+', $\sigma$: '+str(kernel_label[i-1][1]))
    plt.xlabel('$\Theta$: '+str(kernel_label[i-1][2]))
    plt.suptitle('Gabor parameters effect in image reconstruction', fontsize=20)
    plt.tight_layout()

As it can be seen in the figures above, a frequency of 0.4 completely occludes our filter from extracting relevant patterns from the image. Increasing the frequency usually results in capturing finer details. It is our hypothesis that due to the size of images in our dataset and to the image complexity, such frequency might cause the kernel to always result in high values regardless of the shape, hence the image occlusion. Regarding $\sigma$, we observe an optimal recognition of the tiger patterns for $\sigma$=3. We can see that the tiger stripes of different orientations are clearly highlighted using different kernels. At lower $\sigma$ values the image is fully reconstructed and no pattern seems to be highlighted. At higher $\sigma$ values the abstraction of the pattern is emphasized at the cost of being able to distinguish the subject in the picture.

For this reason, with the aim of extracting image patterns while correctly reconstructing the image in our study a frequency of 0.1, $\sigma$ of 3, and 4 different $\Theta$ are used to build our kernels.

We want to point out that intermediate values of $\sigma$ between 3 and 6 might represent a "sweet spot" that would produce the optimal combination between image reconstruction and pattern extraction. Such investigation would need to be coupled with a classification analysis for each combination, which cannot be achieved in the study due to computational cost of such approach.

What do the kernels look like?

The kernels used for further analysis are visualized below. We clearly see the orientation of the kernel matrices, the peak, and valleys of the gaussian distributions in white and black, respectively.

In [31]:
# Parameters used in this project
sigma = 3 
frequency = 0.1

# Building the final kernels
kernels = []
kernel_label = []
for theta in range(4):
    theta = theta / 4. * np.pi    
    kernel = np.real(gabor_kernel(frequency, theta=theta, sigma_x=sigma, sigma_y=sigma))
    kernels.append(kernel)
    kernel_label.append([frequency, sigma, theta])
    
# Visualization of final kernels
plt.figure(figsize=[9,9])
for i in range(1, 5):
    plt.subplot(4,4,i)
    plt.imshow(kernels[i-1], cmap="gray")
    plt.xlabel('$\Theta$: '+str(kernel_label[i-1][2]))
    
plt.suptitle("Kernels at various orientation with $\sigma$=3 and frequency=0.1", fontsize=20)
plt.tight_layout()

Visualization of Gaborn kernels effect on different cats

A visualization with representative images for each class demonstrate that our Gabor kernels clearly recognize specific patterns in the fur of the animals. This is particularly clear when looking at the stripes of the tiger, which as mentioned are strongly highlighted by different kernels. We notice that the complexity of dotted patterns are well distinguished as well when comparing the effect of different kernels next to each other.

In [32]:
gabr_ex_cheetah = gabor_effect(images[1], kernels, (IMAGE_HEIGHT,IMAGE_WIDTH))
gabr_ex_hyena = gabor_effect(images[1200], kernels, (IMAGE_HEIGHT,IMAGE_WIDTH))
gabr_ex_jaguar = gabor_effect(images[2200], kernels, (IMAGE_HEIGHT,IMAGE_WIDTH))
gabr_ex_tiger = gabor_effect(images[3200], kernels, (IMAGE_HEIGHT,IMAGE_WIDTH))

ex_cats_filtered = [ gabr_ex_cheetah, gabr_ex_hyena, gabr_ex_jaguar, gabr_ex_tiger]

plt.figure(figsize=[16,16])

indx=1
for cat in ex_cats_filtered:
    for i in range(1, 5):
        plt.subplot(4,4,indx)
        plt.imshow(cat[i-1], cmap="gray")
        plt.ylabel('Frequency: '+str(kernel_label[i-1][0])+', $\sigma$: '+str(kernel_label[i-1][1]))
        plt.xlabel('$\Theta$: '+str(kernel_label[i-1][2]))
        indx += 1
        
plt.suptitle('Response of different cats to the Gabor kernels', fontsize=20)
plt.tight_layout()

Apply gabor to entire dataset

In [33]:
# takes ~3 minutes to run entire dataset
%time gabor_stats_not_flat = np.apply_along_axis(compute_gabor, 1, images, kernels, (IMAGE_HEIGHT,IMAGE_WIDTH))
gabor_stats = gabor_stats_not_flat.reshape(len(images), len(kernels)*4)
CPU times: user 1h 32min 14s, sys: 45 s, total: 1h 32min 59s
Wall time: 15min 40s
In [34]:
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix_gabor = pairwise_distances(gabor_stats)
CPU times: user 1.18 s, sys: 57.2 ms, total: 1.23 s
Wall time: 236 ms

Gabor validation

Our validation consists in four parts:

  • Qualitative visual inspection of closest images in the distance matrix constructed using the statistics from applying Gabor kernel to the original images.
  • A KNN classification analysis.
  • Class based visualization of the distance matrix via heatmap. We expect images in the same class (of the same animal) to have small values, and maximize distances between classes of cats which do not share similar features.
  • Analysis of different kernel response for each class of images.

Qualitative validation

In [35]:
from ipywidgets import fixed
import copy
from skimage.io import imshow

# put it together inside a nice widget
def closest_image(dmat,idx1):
    # NOTE: this will not suffice for evaluation of a nearest neighbor classifier for your lab assignment
    distances = copy.deepcopy(dmat[idx1,:]) # get all image diatances
    distances[idx1] = np.infty # dont pick the same image!
    idx2 = np.argmin(distances)
    
    distances[idx2] = np.infty
    idx3 = np.argmin(distances)
    
    plt.figure(figsize=(10,16))
    plt.subplot(1,3,1)
    imshow(images[idx1].reshape((IMAGE_HEIGHT,IMAGE_WIDTH)))
    plt.title("Original Image ")
    plt.grid()

    plt.subplot(1,3,2)
    imshow(images[idx2].reshape((IMAGE_HEIGHT,IMAGE_WIDTH)))
    plt.title("Closest Image  ")
    plt.grid()
    
    plt.subplot(1,3,3)
    imshow(images[idx3].reshape((IMAGE_HEIGHT,IMAGE_WIDTH)))
    plt.title("Next Closest Image ")
    plt.grid()
    
In [36]:
from ipywidgets import widgets
widgets.interact(closest_image,idx1=(0,4000-1,1),dmat=fixed(dist_matrix_gabor),__manual=True)
Out[36]:
<function __main__.closest_image(dmat, idx1)>

A visual investigation of the closest images shows a satisfactory performance of our distance analysis based on the statistics of the Gabor filtered images distributions. We noticed that besides fur pattern, which is our inteded feature target for cats distinction, small distances are achieved when the pose of the animal is similar, or when there is a comparable amount of natural background, which is then probably picked as most influencial in the distance calculation.

Quantitative validation

KNN validation

In [49]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(gabor_stats, classifications)
y_pred = knn.predict(gabor_stats)

print("The classification accuracy is: "+str(metrics.accuracy_score(classifications, y_pred)))
The classification accuracy is: 0.60275

Our accuracy reached an accuracy higher than 50%, showing that the Gabor featurization helps the classification algorithm performing better than chance. However, our accuracy does not reach the target accuracy of 90% presented in our business case study. The cause may be the noise introduced by images where the background covers most of the image. To resolve this potential issue, a pre-filtering of the image, manual or using a pattern recognition algorithm specifically trained to recognize natural habitats, could be performed. Secondly, different cats might present similar fur patters or physical features (ears, nose, etc) increasing the difficulty of the classification task.

Heatmap validation

The scope of the heatmap validation is to verify that images in the same class are recognized to be similar, while highly distinct from other classes of animals. Such a visualization can be used to quantitatively rank similarity between different "cat" species. An ideal case scenario heatmap would present lowest values along the diagonal, representing similarity within a class, while maximizing distances for the out of diagonal elements based on the feature diversity of the different cats. For the latter scope, a 2D projection of the space explored by each class might provide a qualitative visualization of similarity and overlap between classes. Unfortunately, the dimensionality reduction performed in this study showed that only 2 components are far from enough for correctly representing our data. For this reason, we will not show a 2D projection using our dimensionality method of choice, PCA, as it misses considerable information hence providing a potentially misleading visualization.

In [73]:
import seaborn as sns
labels = ['cheetah', 'hyena', 'jaguar', 'tiger']

dist_matrix_gabor_bin = dist_matrix_gabor.reshape(4, 1000, 4, 1000).mean(-1).mean(1) # 1000 images in each class for a total of 4 classes
sns.heatmap(dist_matrix_gabor_bin, xticklabels=labels, yticklabels=labels, annot=True)
Out[73]:
<AxesSubplot:>

The heatmap class based visualization of the distance matrix shows that jaguars and tigers are recognized to have the highest similarity within class, respectively. Furthermore, our model recognizes high similarity between these two classes. On the other hand, cheetahs and hyenas have low similarity within classes and are identified closer to jaguars and tigers, providing an information that the latters have general features patterns that are in common among all the cats analyzed.

Which kernel have most effect?: Filter response

In order to understand the suitability of Gabor filters for our animals classification task, a crucial aspect is to verify how each class responds to different kernel filters. To accomplish this task, we investigated the mean and standard deviation of the featured images distributions for each class using histograms. Decomposing the distributions using histograms instead of visualizing the entire distributions allowed us to decompose the various characteristics of the distributions and provided us with an intuitive quantitative measure of comparison.

In [110]:
#Define interval to bin classes
interval = int(len(images)/4)

#Bin stats into classes
gabr_stats_cheetah = gabor_stats_not_flat[:interval,:,:]
gabr_stats_hyena = gabor_stats_not_flat[interval:interval*2,:,:]
gabr_stats_jaguar = gabor_stats_not_flat[interval*2:interval*3,:,:]
gabr_stats_tiger = gabor_stats_not_flat[interval*3:,:,:]

gabr_stats_cats = [gabr_stats_cheetah, gabr_stats_hyena, gabr_stats_jaguar, gabr_stats_tiger]

#Plot histogram of stats - Mean
plt.figure(figsize=[20,10])

k = 1
for cs in gabr_stats_cats:
    plt.subplot(1,4,k)
    plt.title(labels[k-1], fontsize=16)
    
    for i in range(4):
        m = np.mean(cs[:,i,0], axis=0)
        plt.hist(x=i, height=m, label=str(kernel_label[i]))
        
    k += 1
    plt.ylim(0, 0.15)
    plt.legend(title='Kernel Freq., $\sigma$, $\Theta$', fontsize=14)

    
plt.suptitle('Mean response for different Cats using kernels', fontsize=20)
plt.tight_layout()

#Plot histogram of stats - STD
plt.figure(figsize=[20,10])

k = 1
for cs in gabr_stats_cats:
    plt.subplot(1,4,k)
    plt.title(labels[k-1],fontsize=16)
    
    for i in range(4):
        m = np.mean(cs[:,i,1], axis=0)
        plt.hist(x=i, height=m, label=str(kernel_label[i]))
        
    k += 1
    plt.ylim(0, 0.003)
    plt.legend(title='Kernel Freq., $\sigma$, $\Theta$', fontsize=14)
        
plt.suptitle('Standard Deviation for different Cats using kernels', fontsize=20)
plt.tight_layout()

We observe that all the magnitude of reponse of the various to the kernels follow the same trend. Clearly, from lower means and higher standard distribution values, it can be inferred that jaguars and tigers cover a wider array of feature patterns. This could explain why the other cats show relatively high similarity to these two classes in the distance matrix heatmap. In fact, given similar mean values, a higher standard deviation implies that these two classes cover a wide portion of the feature space. The narrower spreads of the distributions of the cheetahs and hyenas show higher selectivity in the feature space, which would provide in an ideal case scenario a better ground for classification.

Is this good news for our Gabor filters?

Yes...and no.

Starting from the positive, this may provide an indication that naturally tigers and jaguars present a wide array of physical features that are common in other cats as well. On the other side, we would have expected more variation in the trend of relative response to the various kernels from our set. If that was the case, information regarding a directionality pattern that strongly identify different class of cats could have been used as information for further classifications or improvement of our feature extraction.

We attempt to achieve this task using a kernel-decomposed, classed-based heatmap analysis of the distance matrix presented below.

In [113]:
plt.figure(figsize=[20,5])

for i in range(4):
    plt.subplot(1,4,i+1)
    dist_matrix_gabor = pairwise_distances(gabor_stats_not_flat[:,i,:])
    dist_matrix_gabor_bin =dist_matrix_gabor.reshape(4, 1000, 4, 1000).mean(-1).mean(1)
    sns.heatmap(dist_matrix_gabor_bin, xticklabels=labels, yticklabels=labels, annot=True, vmin=1, vmax=2.5)
    plt.title('Kernel Freq., $\sigma$, $\Theta$:  \n'+str(kernel_label[i]), fontsize=14)

As mentioned, to gain more information regarding a directionality pattern that strongly identify different class of cats, we decomposed the distance matrix for each kernel. As expected, distances in lower dimensions are shorter, therefore a direct comparison with the general distance matrix heatmap shown above is not pursued.

Our investigation of the diagonal elements resulted in the following:

  • Tigers are best "grouped together" using an angle corresponding to 135$^{\circ}$
  • Jaguars are best "grouped together" using an angle corresponding to 45$^{\circ}$
  • Cheetahs are best "grouped together" using an angle corresponding to 45$^{\circ}$ and 135$^{\circ}$
  • Hyenas are equivalently "grouped together" using all kernels except for the one corresponsing to 0$^{\circ}$ (vertical).

To conclude this analysis, provided further details on the sensitivity of each class to a particular kernel, allowing a better understanding on the inner working of Gabor filters and potentially giving ground for further kernels optimization.

It is important to point out that at the present moment, before a matching analysis we cannot evaluate how much the natural environment in the images is polluting our classfication AND statistics. We further lack information if a particular class is more affected by this problem (more instances of images with wide background for a certain class). Furthermore, this problem is aggravated if the background is the natural habitat. In fact, these animals evolved to have fur patterns that camouflage their natural habitat.

Exceptional Work (DAISY Implemetation)

Initial Hyperparameters

As was suggested in the lecture, we start with the default hyperparameters. However, given a standard 400x400 pixel image, we attempt to find a good step size that provides sufficient overlap efficientely noting that the default step size is 4 pixels. In other words, we're basically exploring adjusting the step-size.

In [136]:
from skimage.feature import daisy
from skimage.color import rgb2gray
import matplotlib.pyplot as plt

test_image = images[get_random_indexes(1,images).pop()]
%time descriptors, vizualization = daisy(image_explode(test_image), step=30, visualize=True)
num_descriptors = descriptors.shape[0] * descriptors.shape[1]
print('DAISY descriptors extracted: {} '.format(num_descriptors))

show_images([0,1], [image_flatten(test_image), image_flatten(rgb2gray(vizualization))], 'Daisy descriptors, step size of 30')
Wall time: 3.46 s
DAISY descriptors extracted: 169 

From the above, we can see a fairly good coverage, but depending on the image, we could be missing quite a bit of the animal. Decreasing the step size further toward the default makes visualization unreadable and takes quite a while. So, let's see what happens when we skip the visualization.

In [137]:
for step in range(4,32,4):
    %time descriptors = daisy(image_explode(test_image), step=step, visualize=False)
    num_descriptors = descriptors.shape[0] * descriptors.shape[1]
    print('DAISY descriptors extracted: {} at step {}\n'.format(num_descriptors, step))
Wall time: 640 ms
DAISY descriptors extracted: 8649 at step 4

Wall time: 632 ms
DAISY descriptors extracted: 2209 at step 8

Wall time: 556 ms
DAISY descriptors extracted: 961 at step 12

Wall time: 539 ms
DAISY descriptors extracted: 576 at step 16

Wall time: 542 ms
DAISY descriptors extracted: 361 at step 20

Wall time: 550 ms
DAISY descriptors extracted: 256 at step 24

Wall time: 532 ms
DAISY descriptors extracted: 196 at step 28

From the above, we can see that we'd only save about 20 seconds over the whole 4,000 image dataset, but would significantly increase the number of descriptors. So, at the default, we will have 93x93 keypoints or 8,649 features, which is a significant reduction in features from the original 160,000 when we simply vectorized the gray-scale pixels.

Exploring the DAISY operator

Here, we begin to explore the DAISY operations. We first start by making the assumption that when we run the required brute-force descriptor match algorithm, more matches is better, i.e. if we compare an image to itself, it will yield 100% match (verified to ensure code functionality, but not shown). We then attempt to see if convolving the images before processing has any affect which we theorize it won't. Finally we runs some tests where we take one image from a small sample, and the modify it to see if we can find a match. As is clear in the images through this lab, our images vary considerably.

Checking the gradient matching theory

In [162]:
import copy

def get_keypoints_and_descriptors(image_list):
    keypoints_list = np.empty(len(image_list), dtype=np.matrix)
    for index, image in enumerate(image_list):
        descriptors = daisy(image_explode(image_list[index]))
        descriptors = descriptors.reshape((descriptors.shape[0] * descriptors.shape[1],-1)) # Flatten
        keypoints_list[index] = descriptors
    return keypoints_list

# Just a random image
thirteen_unmodified = np.empty(shape=(1,IMAGE_FLATTENED_LENGTH), dtype=float)
thirteen_unmodified[0] = images[13]
fourteen_unmodified = np.empty(shape=(1,IMAGE_FLATTENED_LENGTH), dtype=float)
fourteen_unmodified[0] = images[14]

# Their gradient counterpart
thirteen_gradient = copy.deepcopy(thirteen_unmodified)
image = image_explode(thirteen_gradient[0])
image = np.sqrt(sobel_v(image)**2 + sobel_h(image)**2)
thirteen_gradient[0] = image_flatten(image)
fourteen_gradient = copy.deepcopy(fourteen_unmodified)
image = image_explode(fourteen_gradient[0])
image = np.sqrt(sobel_v(image)**2 + sobel_h(image)**2)
fourteen_gradient[0] = image_flatten(image)

show_images(set(np.arange(4)), [thirteen_unmodified[0], fourteen_unmodified[0], thirteen_gradient[0], fourteen_gradient[0]], 'match_descriptors comparison images')

# Get the matches for the unmodified images and then the same for the gradient images
%time match_unmodified = match_descriptors(get_keypoints_and_descriptors(thirteen_unmodified)[0], \
                                     get_keypoints_and_descriptors(fourteen_unmodified)[0])
match_gradient = match_descriptors(get_keypoints_and_descriptors(thirteen_gradient)[0], \
                                   get_keypoints_and_descriptors(fourteen_gradient)[0])

print("Number of matches on unmodified images: {}".format(len(match_unmodified)))
print("Number of matches on gradient images: {}".format(len(match_gradient)))
Wall time: 22.7 s
Number of matches on unmodified images: 335
Number of matches on gradient images: 441

From this we learned that the brute-force descriptor match will take about 20 seconds to execute. We also learned that applying a gradient before using the DAISY operator yielded more matches, which was counterintuitive to us but does make sense. The goal here was to remove some of the extraneuous imagery in the background and leave distinct lines around the animal. It is clear that did not happen simply by looking at the pictures. Now, we have more dark in the background, so more of a chance for a match.

Testing some images

Here, we have 5 fairly similar images of cheetahs. The 4th image (index=3) in the source is also used in the test set. It is kept in its original form and also modified (note that in the last image, it is "flipped"). We should be able to find it each time.

In [168]:
from skimage.transform import resize
from skimage.filters import sobel_h, sobel_v

TEST_FILE_COUNT = 5
SOURCE_FILE_COUNT = 5

validation_source = np.empty(shape=(SOURCE_FILE_COUNT,IMAGE_FLATTENED_LENGTH), dtype=float)
validation_test = np.empty(shape=(TEST_FILE_COUNT,IMAGE_FLATTENED_LENGTH), dtype=float)

base_dir = os.path.join('.', 'daisy_images')
for index in range(SOURCE_FILE_COUNT):
    filename = "cheetah_source_" + str(index) + ".jpg"
    image = io.imread(os.path.join(base_dir, filename), as_gray=True)
    validation_source[index] = image.flatten()
for index in range(TEST_FILE_COUNT):
    filename = "cheetah_test_" + str(index) + ".jpg"
    image = io.imread(os.path.join(base_dir, filename), as_gray=True)
    image = resize(image, (IMAGE_HEIGHT, IMAGE_WIDTH))
    validation_test[index] = image.flatten()

show_images(set(np.arange(SOURCE_FILE_COUNT)), validation_source, "Source Images", True)
show_images(set(np.arange(TEST_FILE_COUNT)), validation_test, "Test Images", True)
In [180]:
def match_by_count(keypoints1, keypoints2):
    """Match by the simple count of matching descriptors found.  More matches = higher score = closer match"""
    match = match_descriptors(keypoints1, keypoints2)
    score = len(match)
    return (True, score)

def run_simulation(source_images, test_images):
    # Each element will be a list of keypoints which will in turn have a list of descriptors
    source_keypoints = get_keypoints_and_descriptors(source_images)
    test_keypoints = get_keypoints_and_descriptors(test_images)

    # This parallel lisy of test_inmages will contain the index of the best match in the source.  -1 indicates
    # no match
    test_matches = np.empty(len(test_images), dtype=int)
    test_matches.fill(-1)

    for test_index, test_keypoints_for_an_image in enumerate(test_keypoints):
        print("Processing test image {}".format(test_index))
        best_score = -1
        for source_index, souce_keypoints_for_an_image in enumerate(source_keypoints):
            (match, match_score) = match_by_count(test_keypoints[test_index], source_keypoints[source_index])
            if match and best_score < match_score:
                test_matches[test_index] = source_index
                best_score = match_score
            
    for test_index, test_keypoints_for_an_image in enumerate(test_keypoints):
        print("Best match for test image {} is {}".format(test_index, test_matches[test_index]))
        
run_simulation(validation_source, validation_test)
Processing test image 0
Processing test image 1
Processing test image 2
Processing test image 3
Processing test image 4
Best match for test image 0 is 3
Best match for test image 1 is 3
Best match for test image 2 is 3
Best match for test image 3 is 3
Best match for test image 4 is 3

In the above, the DAISY operator with a brute force matcher that uses the number of descriptors matches as a closeness metric was able to identify the proper images. Note that we used the L2-norm (default) metric to determine a matching indicies across the two set of descriptors for each keypoint.

Running DAISY on the larger dataset

Now, we turn to the larger dataset. Since we know that it takes about 20 seconds to run the brute-force match, we chose a dataset that would complete in a reasonable amount of time. We will use classifications as a measure of success. That is, if our test image is a cheetah, and we find that the closer match selected in the source set is also a cheetah, then we will consider the match to be a success.

In [207]:
# Split the data set into source and test images.  We will match the test images against the source

from sklearn.model_selection import train_test_split

DAISY_SOURCE_SIZE = 200
DAISY_TEST_SIZE = 1
source_images, test_images, source_class, test_class = train_test_split(images, classifications, test_size=DAISY_TEST_SIZE, train_size=DAISY_SOURCE_SIZE)
print("Source classifiers and counts " + str(np.unique(source_class, return_counts=True)))
print("Test classifiers and counts " + str(np.unique(test_class, return_counts=True)))

#show_images(set(np.arange(5)), source_images, "Source Images")
#show_images(set(np.arange(2)), test_images, "Test Images")
Source classifiers and counts (array([0, 1, 2, 3]), array([55, 45, 56, 44], dtype=int64))
Test classifiers and counts (array([0]), array([1], dtype=int64))
In [210]:
def get_classifier_name(int_value):
    for key in CLASSIFICATIONS.keys():
        if CLASSIFICATIONS[key] == int_value:
            return key

def run_classification(source_images, test_images):
    # Each element will be a list of keypoints which will in turn have a list of descriptors
    print("Generating keypoints")
    source_keypoints = get_keypoints_and_descriptors(source_images)
    test_keypoints = get_keypoints_and_descriptors(test_images)

    # This parallel lisy of test_inmages will contain the index of the best match in the source.  -1 indicates
    # no match
    test_matches = np.empty(len(test_images), dtype=int)
    test_matches.fill(-1)

    print()
    for test_index, test_keypoints_for_an_image in enumerate(test_keypoints):
        if test_index % 5 == 0:
            print("Processing test image {}".format(test_index))
        best_score = -1
        for source_index, souce_keypoints_for_an_image in enumerate(source_keypoints):
            if source_index % 20 == 0:
                print("  Processing source image {}".format(source_index))
            (match, match_score) = match_by_count(test_keypoints[test_index], source_keypoints[source_index])
            if match and best_score < match_score:
                test_matches[test_index] = source_index
                best_score = match_score

    print()
    success = 0
    for test_index in range(len(test_matches)):
        source_index = test_matches[test_index]
        source_class_name = get_classifier_name(source_class[source_index])
        test_class_name = get_classifier_name(test_class[test_index])
        print("Best match for test image {} ({}) is source image {} ({})".format(test_index, test_class_name, source_index, source_class_name))
        if (source_class[source_index] == test_class[test_index]):
            success += 1

    print("Overall success rate {:.1%}".format(success/len(test_matches)))
    
run_classification(source_images, test_images)
Generating keypoints

Processing test image 0
  Processing source image 0
  Processing source image 20
  Processing source image 40
  Processing source image 60
  Processing source image 80
  Processing source image 100
  Processing source image 120
  Processing source image 140
  Processing source image 160
  Processing source image 180

Best match for test image 0 (cheetah) is source image 55 (cheetah)
Overall success rate 100.0%
In [211]:
source_images, test_images, source_class, test_class = train_test_split(images, classifications, test_size=DAISY_TEST_SIZE, train_size=DAISY_SOURCE_SIZE)
print("Source classifiers and counts " + str(np.unique(source_class, return_counts=True)))
print("Test classifiers and counts " + str(np.unique(test_class, return_counts=True)))
run_classification(source_images, test_images)
Source classifiers and counts (array([0, 1, 2, 3]), array([50, 50, 49, 51], dtype=int64))
Test classifiers and counts (array([1]), array([1], dtype=int64))
Generating keypoints

Processing test image 0
  Processing source image 0
  Processing source image 20
  Processing source image 40
  Processing source image 60
  Processing source image 80
  Processing source image 100
  Processing source image 120
  Processing source image 140
  Processing source image 160
  Processing source image 180

Best match for test image 0 (jaguar) is source image 147 (cheetah)
Overall success rate 0.0%
In [212]:
show_images(set(np.arange(2)), [test_images[0], source_images[147]], "Match failure example")

In the above failure case, several observations may be made. Both backgrounds are blurred, and we anticipate that there are several matches therein. Also, note the spots on this particular jaguar do not strongly exhibit the common rosette pattern, and appear more spot-like. Finally, the profile around the nose reminds us of the tear-like marking on the cheetah.

In [215]:
source_images, test_images, source_class, test_class = train_test_split(images, classifications, test_size=DAISY_TEST_SIZE, train_size=DAISY_SOURCE_SIZE)
print("Source classifiers and counts " + str(np.unique(source_class, return_counts=True)))
print("Test classifiers and counts " + str(np.unique(test_class, return_counts=True)))
run_classification(source_images, test_images)
Source classifiers and counts (array([0, 1, 2, 3]), array([54, 50, 53, 43], dtype=int64))
Test classifiers and counts (array([3]), array([1], dtype=int64))
Generating keypoints

Processing test image 0
  Processing source image 0
  Processing source image 20
  Processing source image 40
  Processing source image 60
  Processing source image 80
  Processing source image 100
  Processing source image 120
  Processing source image 140
  Processing source image 160
  Processing source image 180

Best match for test image 0 (hyena) is source image 169 (tiger)
Overall success rate 0.0%
In [216]:
show_images(set(np.arange(2)), [test_images[0], source_images[169]], "Match failure example")

The above image is likey keying in on the foilage in the background.

In [217]:
DAISY_SOURCE_SIZE=400

source_images, test_images, source_class, test_class = train_test_split(images, classifications, test_size=DAISY_TEST_SIZE, train_size=DAISY_SOURCE_SIZE)
print("Source classifiers and counts " + str(np.unique(source_class, return_counts=True)))
print("Test classifiers and counts " + str(np.unique(test_class, return_counts=True)))
run_classification(source_images, test_images)
Source classifiers and counts (array([0, 1, 2, 3]), array([103,  89,  95, 113], dtype=int64))
Test classifiers and counts (array([1]), array([1], dtype=int64))
Generating keypoints

Processing test image 0
  Processing source image 0
  Processing source image 20
  Processing source image 40
  Processing source image 60
  Processing source image 80
  Processing source image 100
  Processing source image 120
  Processing source image 140
  Processing source image 160
  Processing source image 180
  Processing source image 200
  Processing source image 220
  Processing source image 240
  Processing source image 260
  Processing source image 280
  Processing source image 300
  Processing source image 320
  Processing source image 340
  Processing source image 360
  Processing source image 380

Best match for test image 0 (jaguar) is source image 285 (jaguar)
Overall success rate 100.0%
In [219]:
show_images(set(np.arange(2)), [test_images[0], source_images[285]], "Match success example")

Here, we have two photos that are head-on shots with a fair amount of the jaguar's spots showing. In this example, we ran the simulation over 400 source images, divided pretty evenly, so we had a much better chance of actually having similar images in the subset.

Conclusion

From the above, we may conclude that a brute-force DAISY operator solution is not a viable solution to our classification problem. It is very computationally intensive, and would become moreso when run against modern tree-mounted wildlife cameara pictures which routinely produce pictures on the order of 20-100 megapixels. Keep in mind that these were 400x400 grayscale images.

Also note that we used a very simple measure of closeness -- the amount of matching descriptors. Further exploration in our domain would likely yeild better responses. For example, we could select source images that are tightly cropped and do not have trees, etc. in the background. This could be a viable approach for our probelm if we were looking for certain species. We could also weight clustering of matches to filter out some of the random noise. That is, we would expect a series of matches across the indicies in close proximity, not simply randomly spread throughout the images.

Finally, given the time to execute (on Surface Laptop) we were not able to validate our success criteria of a 90% match over even 5% of our dataset, so these results must be be considered preliminary at best.

In [ ]: